273 research outputs found
Recommended from our members
Improving the Power of GWAS and Avoiding Confounding from Population Stratification with PC-Select
Using a reduced subset of SNPs in a linear mixed model can improve power for genome-wide association studies, yet this can result in insufficient correction for population stratification. We propose a hybrid approach using principal components that does not inflate statistics in the presence of population stratification and improves power over standard linear mixed models
Population Structure and Eigenanalysis
Current methods for inferring population structure from genetic data do not provide formal significance tests for population differentiation. We discuss an approach to studying population structure (principal components analysis) that was first applied to genetic data by Cavalli-Sforza and colleagues. We place the method on a solid statistical footing, using results from modern statistics to develop formal significance tests. We also uncover a general āphase changeā phenomenon about the ability to detect structure in genetic data, which emerges from the statistical theory we use, and has an important implication for the ability to discover structure in genetic data: for a fixed but large dataset size, divergence between two populations (as measured, for example, by a statistic like F(ST)) below a threshold is essentially undetectable, but a little above threshold, detection will be easy. This means that we can predict the dataset size needed to detect structure
Recommended from our members
Identifying Repeat Domains in Large Genomes
We present a graph-based method for the analysis of repeat families in a repeat library. We build a repeat domain graph that decomposes a repeat library into repeat domains, short subsequences shared by multiple repeat families, and reveals the mosaic structure of repeat families. Our method recovers documented mosaic repeat structures and suggests additional putative ones. Our method is useful for elucidating the evolutionary history of repeats and annotating de novo generated repeat libraries
Explicit Modeling of Ancestry Improves Polygenic Risk Scores and BLUP Prediction
Polygenic prediction using genome-wide SNPs can provide high prediction accuracy for complex traits. Here, we investigate the question of how to account for genetic ancestry when conducting polygenic prediction. We show that the accuracy of polygenic prediction in structured populations may be partly due to genetic ancestry. However, we hypothesized that explicitly modeling ancestry could improve polygenic prediction accuracy. We analyzed three GWAS of hair color (HC), tanning ability (TA), and basal cell carcinoma (BCC) in European Americans (sample size from 7,440 to 9,822) and considered two widely used polygenic prediction approaches: polygenic risk scores (PRSs) and best linear unbiased prediction (BLUP). We compared polygenic prediction without correction for ancestry to polygenic prediction with ancestry as a separate component in the model. In 10-fold cross-validation using the PRS approach, the R(2) for HC increased by 66% (0.0456-0.0755; P < 10(-16)), the R(2) for TA increased by 123% (0.0154 to 0.0344; P < 10(-16)), and the liability-scale R(2) for BCC increased by 68% (0.0138-0.0232; P < 10(-16)) when explicitly modeling ancestry, which prevents ancestry effects from entering into each SNP effect and being overweighted. Surprisingly, explicitly modeling ancestry produces a similar improvement when using the BLUP approach, which fits all SNPs simultaneously in a single variance component and causes ancestry to be underweighted. We validate our findings via simulations, which show that the differences in prediction accuracy will increase in magnitude as sample sizes increase. In summary, our results show that explicitly modeling ancestry can be important in both PRS and BLUP prediction
Recommended from our members
Fast and accurate long-range phasing in a UK Biobank cohort
Recent work has leveraged the extensive genotyping of the Icelandic population to perform long-range phasing (LRP), enabling accurate imputation and association analysis of rare variants in target samples typed on genotyping arrays. Here, we develop a fast and accurate LRP method, Eagle, that extends this paradigm to populations with much smaller proportions of genotyped samples by harnessing long (>4cM) identical-by-descent (IBD) tracts shared among distantly related individuals. We applied Eagle to Nā150,000 samples (0.2% of the British population) from the UK Biobank, and we determined that it is 1ā2 orders of magnitude faster than existing methods while achieving similar or better phasing accuracy (switch error rate ā0.3%, corresponding to perfect phase in a majority of 10Mb segments). We also observed that when used within an imputation pipeline, Eagle pre-phasing improved downstream imputation accuracy compared to pre-phasing in batches using existing methods (as necessary to achieve comparable computational cost)
Progress and promise in understanding the genetic basis of common diseases
Susceptibility to common human diseases is influenced by both genetic and environmental factors. The explosive growth of genetic data, and the knowledge that it is generating, are transforming our biological understanding of these diseases. In this review, we describe the technological and analytical advances that have enabled genome-wide association studies to be successful in identifying a large number of genetic variants robustly associated with common disease. We examine the biological insights that these genetic associations are beginning to produce, from functional mechanisms involving individual genes to biological pathways linking associated genes, and the identification of functional annotations, some of which are cell-type-specific, enriched in disease associations. Although most efforts have focused on identifying and interpreting genetic variants that are irrefutably associated with disease, it is increasingly clear thatāeven at large sample sizesāthese represent only the tip of the iceberg of genetic signal, motivating polygenic analyses that consider the effects of genetic variants throughout the genome, including modest effects that are not individually statistically significant. As data from an increasingly large number of diseases and traits are analysed, pleiotropic effects (defined as genetic loci affecting multiple phenotypes) can help integrate our biological understanding. Looking forward, the next generation of population-scale data resources, linking genomic information with health outcomes, will lead to another step-change in our ability to understand, and treat, common diseases
- ā¦